Skip to content

BUG: AttributeError when read_sas used with GCS #33070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 7, 2020

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Mar 27, 2020

With GCSFS (and when a binary file buffer is passed in), the output from
the file read is bytes, not a Unicode str. The encode call is unnecessary
in this case.

With GCSFS (and possibly other connectors), the output from they file
read is `bytes`, not a Unicode `str`. The `encode` call is unnecessary
in this case.
@tswast tswast force-pushed the issue33069-read_sas-gcsfs branch from 8f760d2 to f31b6b2 Compare March 27, 2020 15:30
@tswast tswast marked this pull request as ready for review March 27, 2020 15:30
@tswast
Copy link
Contributor Author

tswast commented Mar 27, 2020

Not sure why "20.00% of diff hit (target 50.00%)", since I added a test and it's only one additional line of code. Maybe needs another test in text-mode?

Edit: None of the available XPT data files can be opened as a text file. It seems the UnicodeEncodeError passthrough may no longer be required at all with Python 3.

@WillAyd WillAyd added this to the 1.1 milestone Mar 27, 2020
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback jreback added the IO SAS SAS: read_sas label Mar 29, 2020
@jreback
Copy link
Contributor

jreback commented Mar 29, 2020

@bashtage if you'd have a look.

@simonjayhawkins
Copy link
Member

@tswast can you merge upstream master to resolve conflicts.

@@ -265,10 +265,6 @@ def __init__(
else:
# Copy to BytesIO, and ensure no encoding
contents = filepath_or_buffer.read()
try:
contents = contents.encode(self._encoding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the encoding every applied to contents after this change? Does this matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encoding was only to get from text-mode to binary mode. As far as I can tell, SAS XPORT files don't have a text-mode, so the encoding is unnecessary in Python 3.

Later on, bytes corresponding to text columns are decoded into strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks right.

try:
contents = contents.encode(self._encoding)
except UnicodeEncodeError:
pass
self.filepath_or_buffer = BytesIO(contents)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to determine if it is a binary file handle already, so that one can avoid copying the file to a BytesIO and the having to reread this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the conversion to BytesIO is always unnecessary on Python 3. Removed this conversion and added a comment with my thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://www.loc.gov/preservation/digital/formats/fdd/fdd000464.shtml, rows are padded with Null bytes, so I think reading an XPORT file in text-mode on Python 3 will always fail.

@tswast tswast requested a review from bashtage April 2, 2020 14:50
@bashtage
Copy link
Contributor

bashtage commented Apr 2, 2020

Linting

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced to minimal version it seems. LGTM, once green (linting)

@@ -265,10 +265,6 @@ def __init__(
else:
# Copy to BytesIO, and ensure no encoding
contents = filepath_or_buffer.read()
try:
contents = contents.encode(self._encoding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks right.

@mroeschke mroeschke merged commit c82cb17 into pandas-dev:master Apr 7, 2020
@mroeschke
Copy link
Member

Thanks @tswast

@tswast tswast deleted the issue33069-read_sas-gcsfs branch April 7, 2020 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_sas fails when passed a file object from GCSFS
6 participants